Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

182 ◾ Bioinformatics

columns in a different data frame called “countdata” and then we need to add column

names and row names to that count data frame. The row names can be the transcript IDs

and the column names can be the sample IDs as listed in the sample info file. The sample

IDs in the sample info file must be in the same order as the columns of the read counts in

the read count file.

countdata0 <- seqdata[,-(1:2)]

head(countdata0)

The “head” function will display the first rows of the “countdata0” data frame. You can

notice that, as shown in Figure 5.4, the data frame is without row names and that the col-

umn names do not indicate the sample names. You can also notice that there are numer-

ous rows with all columns being zeros. This is mainly because we have aligned reads for

chromosome 22 only.

The second step is to make the gene symbol as the row names and sample IDs as col-

umn names of the “countdata0” data frame and to remove rows with zero for all samples

(Figure 5.5).

rownames(countdata0) <- seqdata[,1]

colnames(countdata0) <- sampleinfo$sampleid

countdata <- countdata0[rowSums(countdata0[])>0,]

head(countdata)

After creating a count data frame as in Figure 5.5, the next step is to create a DGEList

object to hold the read counts that will be analyzed by EdgeR. The DGEList object is a

container for the count data and the associated metadata, including sample names, sample,

group, library size, and normalization factors. The DGEList for our example data is created

by the following:

group = factor(sampleinfo$condition)

y <- DGEList(countdata, group=group)

Figure 5.6 shows the created DGEList object. At this time, it holds two slots: counts and

samples. The counts slot contains the count data and the samples slot contains the sample

FIGURE 5.4 The count data frame without row name and column names.